Single-Channel Multi-talker Speech Recognition with Permutation Invariant Training

نویسندگان

  • Yanmin Qian
  • Xuankai Chang
  • Dong Yu
چکیده

Although great progresses have been made in automatic speech recognition (ASR), significant performance degradation is still observed when recognizing multi-talker mixed speech. In this paper, we propose and evaluate several architectures to address this problem under the assumption that only a single channel of mixed signal is available. Our technique extends permutation invariant training (PIT) by introducing the frontend feature separation module with the minimum mean square error (MSE) criterion and the back-end recognition module with the minimum cross entropy (CE) criterion. More specifically, during training we compute the average MSE or CE over the whole utterance for each possible utterance-level output-target assignment, pick the one with the minimum MSE or CE, and optimize for that assignment. This strategy elegantly solves the label permutation problem observed in the deep learning based multi-talker mixed speech separation and recognition systems. The proposed architectures are evaluated and compared on an artificially mixed AMI dataset with both twoand threetalker mixed speech. The experimental results indicate that our proposed architectures can cut the word error rate (WER) by 45.0% and 25.0% relatively against the state-of-the-art singletalker speech recognition system across all speakers when their energies are comparable, for twoand three-talker mixed speech, respectively. To our knowledge, this is the first work on the multi-talker mixed speech recognition on the challenging speakerindependent spontaneous large vocabulary continuous speech task. Keywords—permutation invariant training, multi-talker mixed speech recognition, feature separation, joint-optimization

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Recognizing Multi-Talker Speech with Permutation Invariant Training

In this paper, we propose a novel technique for direct recognition of multiple speech streams given the single channel of mixed speech, without first separating them. Our technique is based on permutation invariant training (PIT) for automatic speech recognition (ASR). In PIT-ASR, we compute the average cross entropy (CE) over all frames in the whole utterance for each possible output-target as...

متن کامل

Monaural speech separation and recognition challenge

Robust speech recognition in everyday conditions requires the solution to a number of challenging problems, not least the ability to handle multiple sound sources. The specific case of speech recognition in the presence of a competing talker has been studied for several decades, resulting in a number of quite distinct algorithmic solutions whose focus ranges from modeling both target and compet...

متن کامل

Microphone-array speech recognition via incremental map training

For a hidden Markov model (HMM) based speech recognition system it is desirable to combine enhancement of the acoustical signal and statistical representation of model parameters , ensuring both a high quality speech signal and an appropriately trained HMM. In this paper the incre-mental variant of maximum a posteriori (MAP) estimation is used to adjust the parameters of a talker-independent HM...

متن کامل

Temporal cues for consonant recognition: training, talker generalization, and use in evaluation of cochlear implants.

Limited consonant phonemic information can be conveyed by the temporal characteristics of speech. In the two experiments reported here, the effects of practice and of multiple talkers on identification of temporal consonant information were evaluated. Naturally produced /aCa/disyllables were used to create "temporal-only" stimuli having instantaneous amplitudes identical to the natural speech s...

متن کامل

Super-human multi-talker speech recognition: the IBM 2006 speech separation challenge system

We describe a system for model based speech separation which achieves super-human recognition performance when two talkers speak at similar levels. The system can separate the speech of two speakers from a single channel recording with remarkable results. It incorporates a novel method for performing two-talker speaker identification and gain estimation. We extend the method of model based high...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1707.06527  شماره 

صفحات  -

تاریخ انتشار 2017